Lecture 4.1 - Comparing Groups
Comparing Gropus
Setting Expectations
Today we are working with a dataset on wages in the U.S. collected by the US Department of Labor: us.dol.wages.csv
As before, we will first assume that the dataset approximately represents all workers in the U.S. – statistics of the mean/proportions can serve as stand-ins for the true population parameters.
The data columns in this dataset are:
exp
: Years experiencewks
: Weeks worked per yearbluecol
: Is the job a blue collar jobind
: Works in the manufacturing industrysouth
: Is the person working in the American Southmarried
: Is the person marriedsex
: What is the sex of the personunion
: Is the person in a unioned
: Number of years of educationblack
: Is the person Blacklwage
: Log of wages per weekwage
: Wage of the person per week
First, form two hypotheses about the data based on categorical variable subgroups with respect to the response variable wage
. You might think that union member earn more than non-union members, for example, or that males earn more than females.
You can hypothesize two subgroups are the same on some characteristic or different, but the hypothesis must be meaningful and informative.
Write out two hypotheses about the data, stating the null hypothesis and alternative hypothesis and write a justification for your hypothesis.
Then, make a general statement about whether you expect the differences to be substantively large or small. Do you expect men to earn a little more than women or a lot more, for example?
Note how large you expect the difference between the categories to be.
Based on a sample size of 100 oand the nature of the hypotheses, select an appropriate alpha value.
Create a sample of size 100. Remember, you can do this with the following command:
us.dol.sample <- us.dol.wages %>%
slice_sample(n=100, replace=TRUE)
Summarizing the Data
- Make a simple table from for each side of your categorical variables with respect to the response variable (
wage
).
You can get the information by filtering the data and summarizing it as follows:
us.dol.sample %>%
filter(married=="no") %>%
summarize(mean(wage))
Also make a second table with summary data (mean, standard deviation, range) for your response variable (
wage
).Write a few sentences interpreting your summary statistics. Make sure your tables with your summary statistics provide appropriate information with respect to your hypotheses. Does the difference between the subgroups on the variables of interest seem large or small compared to the range of your response variable?
Check the conditions for a \(t\) test – does the data support using a \(t\) test?
Analyze
Calculate the \(p\) value of the difference to test your hypotheses by hand, do not use R’s \(t\) test function (we will examine that in the second half of class). To choose the appropriate degree of freedom, you can use the textbook shortcut of number of degrees of freedom of the smaller of the \(n\). > Remember, the formula for \(df\) of a \(t\) test is: \(t=\frac{\bar{y_1}-\bar{y_2}}{SE}\);\(SE=\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\)
Make a table with five columns, the first two columns being the mean salary of the two groups (for example, black/white), the third column is the size of the difference, the fourth column being the confidence interval for the difference, and the fifth column is the \(p\) value associated with whether the difference is statistically significantly different from zero.
Now let’s use R to conduct \(t\) tests on the data. The general form of a \(t\) test in R is:
t.test(groupa, groupb)
for a two group comparison ort.test(group, mu=<your hypothesis>)
for a one group comparison wheremu
is your null hypothesis. Find the equivalent piece of information for each cell in your table. How close were your results calculated by hand to the result calculated by R?
Hint: to quickly create a subset, you can use the following command:
females.sample <- subset(us.dol.sample, sex=="female")
males.sample <- subset(us.dol.sample, sex=="male")
t.test(females.sample$wage, males.sample$wage)
- Now find the “true” population difference from the entire dataset. Did the confidence interval for your difference cover the entire population difference?
Interpret
Make some comments regarding whether the results surprised you or confirmed what you thought about the world.
Also make some notes regarding whether the differences in means that you found were substantively significant
Write up a profile of an ‘average’ person in this dataset and describe what features this person has that cause them to earn more or less than someone in the opposite demographic categories that they are in.
Finally, note whether you think there are any other factors we should consider when evaluating your hypotheses or if you think there are any problems with just using a t test to evaluate the difference.